January, 2018

Introduction

  • Day 1 - Getting started
  • Day 2 - Let's code
  • Day 3 - Tidyverse
  • Day 4 - Plotly
  • Day 5 - Shiny Introduction
  • Day 6 - Reactivity
  • Day 7 - Modules
  • Day 8 - Shiny Project

Day 1 - Getting started

Day 1 - Agenda

  • Installing R
  • Installing Packages
  • Finding Help

Installing R & Rstudio

  • Go to your internet browser and download R
  • Also, download RStudio
  • Run the R exe and wait for it to finish installing
  • Once done, go ahead and install RStudio

Installing Git & Tortise Git

  • Go to your internet browser and download Git
  • Also, download TortoiseGit
  • Install Git and then TortoiseGit

Setup git proxy

Clone git repo

Installing Packages

  • To Install packages you can run install.packages("package name")
  • Lets try by installing the most important package we will be using later in the course; tidyverse
  • Tidyverse is a set of packages that makes data science easy
  • install.packages("tidyverse")
  • You can also use click Install button on the Packages tab, located on the bottom right

Finding Help

  • To find help on a function or package that you have already installed, go to the Help tab on the bottom right and search for a package name or function in the search box
  • Alternatively, you can run ??function_name. E.g. ??tidyverse

Rap up

  • Installing R
  • Installing Packages
  • Finding Help

Day 2 - Let's code

Day 2 - Agenda

  • Functions
  • Spark

Functions

Example 1 - Hello World

myFunction<-function(){
  print("Hello World")
}
myFunction()
## [1] "Hello World"

Functions

Example 2 - with inputs

myFunction<-function(a,b=2){
  total<-a+b
  return(total)
}

myFunction(1,1)
## [1] 2
myFunction(1)
## [1] 3

Functions

Example 3 - using titanic data and glm function to fit a logistic regression

install.packages("titanic")
library(titanic)

fit<-glm( 
  data = titanic_train,
  formula = Survived ~ Sex + Age + Pclass,
  family = "binomial"
)

Functions

Example 4 - use 'rio' package to read and write data from files

install.packages("rio")
data<-rio::import(file = "Data/titanic_train.csv",setclass = "tbl",integer64="double")
rio::export(x = titanic_train,file = "Data/titanic_train.csv")

Spark & db

When working with big data use Spark Spark is much faster than working with just R and can handle data that is of very very large size Note that not all R functions work in Spark

install.packages("sparklyr")
library(sparklyr)

spark_home_set("Spark/spark-2.2.1-bin-hadoop2.7")
sc<-spark_connect(master="local") # Create a connection to spark

data<-spark_read_csv(
  sc,
  "titanic",
  "Data/titanic_train.csv",
  memory = FALSE,
  overwrite = TRUE
)

#import from R
import_iris<-copy_to(sc,iris,"spark_iris",overwrite=TRUE)

Exercise 1

  1. Write a function (sim.pi) that takes one argument (iterations) with a default value of 1000
  2. Generate two vectors (x,y) of length iterations which uniformly distributed between (-1,1)
  3. Test whether if each of the cordinates fall inside the unit circle
    HINT: ifelse( x^2 + y^2 <=1, TRUE, FALSE)
  4. Count how many of the cordinates fell inside the unit circle (in)
  5. return 4*in/iterations
  6. Congradulations you estimated \(\pi\)!

Day 3 - Tidyverse

Day 3 - Agenda

  • select
  • filter
  • arrange
  • mutate
  • summarise
  • group_by
  • %>% (pipe)

select

It’s not uncommon to get data sets with hundreds or even thousands of variables. In this case, the first challenge is often narrowing in on the variables you’re actually interested in. select() allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.

select(diamonds, cut, color, carat, price)
select(diamonds, x:z)
select(diamonds, -(x:z))
select(diamonds, starts_with("c"))
select(diamonds, ends_with("e"))
select(diamonds, contains("r"))

TIP: Move sorting variables to the start of the data frame and only keep the important variables. Variables can be renamed at the same time.

filter

filter() allows you to subset observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame.

filter(diamonds, cut=="Ideal")
filter(diamonds, cut!="Ideal")
filter(diamonds, carat>=4) # <, >, ==, !=, <=, >=
filter(diamonds, cut=="Ideal" & carat>=4 )
filter(diamonds, cut=="Ideal" | carat>=4 )
filter(diamonds, cut %in% c("Ideal","Premium"))

sqrt(2)^2 == 2
near(sqrt(2)^2, 2)

arrange

arrange() works similarly to filter() except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.

arrange(diamonds, cut) #A-Z
arrange(diamonds, desc(cut)) #Z-A

arrange(diamonds, price) #Small to large
arrange(diamonds, desc(cprice)) #Large to small

arrange(diamonds, cut, desc(price)) #by two or more variables

mutate

Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. That’s the job of mutate().

  • Arithmetic: +, -, *, /, ^
  • Modular: %/% (integer division), %% (remainder), x == y * (x %/% y) + (x %% y)
  • Logs: log(), log2(), logn()
  • Offsets: lead(), lag()
  • Cumulatives: cumsum(), cumprod(), cummin(), cummax() See RcppRoll package for more.
  • Logical: <, <=, >, >=, !=, ==
  • Ranking: min_rank(), row_number(), dense_rank(), percent_rank(), cume_dist(), ntile()
  • User defined: function(){} – should be a vectorised function

mutate

TIP: Arithmetic operators are useful in conjunction with aggregate functions, e.g. X/sum(X) gives the proportion, and Y-mean(Y) computes the difference from the mean.
TIP: Offsets allows you to compute running differences (e.g. x-lag(x)) or find when values change (X != lag(X)) They are most useful in conjunction with group_by(), but make sure to sort first using arrange().

mutate(
  diamonds,
  price_p_carat = price / carat,
  diff = price_p_carat - mean(price_p_carat),
  z_score = diff / sd(price_p_carat)
)

summarise

The last key verb is summarise(). It collapses a data frame to a single row. summarise() is not terribly useful unless we pair it with group_by().

  • TIP: There are many built in functions so don’t reinvent the wheel.
  • TIP: The result of a summary can be used directly in the next step to calculate other statistics.
  • WARNING: Remember when calculating statistics that the result is not always as you would expect, e.g. mean() returns the straight average not the weighted average.
  • WARNING: Always check the documentation before using built in functions to know what options there are and what the default options are. It is important to understand exactly what you are calculating.

summarise

summarise(
  diamonds,
  N = n(),
  sum = sum(price),
  ave1 = sum / N,
  SSD = sum( (price - mean(price)) ^2),
  SD = sqrt( SSD / (n() -1) )
)
## # A tibble: 1 x 5
##       N       sum   ave1          SSD      SD
##   <int>     <int>  <dbl>        <dbl>   <dbl>
## 1 53940 212135217 3932.8 858473135517 3989.44

group_by

summarise() is not terribly useful unless we pair it with group_by(). When you use the dplyr verbs on a grouped data frame they’ll be automatically applied “by group”.

TIP: group_by() is useful when calculating statistics per group. These statistics can then be easily compared.

TIP: Complicated models can also be built and then run on a group-by-group basis.

WARNING: When using group_by() with summarise() the groups get unwound after the summarise(). That means if you group by Var1 and Var2 after doing a summary the data frame will only be grouped by Var1. Thus the order of the variables used in the group_by() matter.

group_by

diamonds_grouped <- group_by(diamonds,cut)
summarise(
  diamonds_grouped,
  N = n(),
  average = mean(price),
  SD = sd(price)
)
## # A tibble: 5 x 4
##         cut     N  average       SD
##       <ord> <int>    <dbl>    <dbl>
## 1      Fair  1610 4358.758 3560.387
## 2      Good  4906 3928.864 3681.590
## 3 Very Good 12082 3981.760 3935.862
## 4   Premium 13791 4584.258 4349.205
## 5     Ideal 21551 3457.542 3808.401

%>% (pipe)

%>% is used to string functions together. This makes writing a set of logic clear and condensed.

diamonds%>%
  group_by(color, clarity)%>%
  summarise(n = n())%>%
  mutate(prop=n/sum(n))%>%
  plot_ly( x = ~color, y = ~prop, color= ~clarity,type = "bar",colors = pal_deloitte)%>%
  layout(barmode = "stack")

%>% (pipe)

Exercise

Using the transition data calculate the transistion matrix for each segment. A transition rate is defined as: \[p_{ij}=Pr({X_{t+1}=j|X_{t}=i})\] \[p_{ij}=\frac{\sum_n balance_{n,t} \times I(X_{n,t+1}=j|X_{n,t}=i)}{\sum_n balance_{n,t} \times I(X_{n,t}=i)}\] HINT: Make sure that your rows sum up to one

Day 4 - Plotly

Day 4 - Agenda

  • Scatter plots
  • Line plots
  • Bar charts
  • Heatmaps
  • Box plots
  • Histograms

Scatter plots

diamonds%>%
  dplyr::sample_n(1000)%>%
  plot_ly(colors = pal_deloitte)%>%
  add_markers(
    x = ~carat, 
    y = ~price, 
    color = ~color,
    size = ~carat, 
    text = ~paste("Clarity: ", clarity)
  )

Scatter plots

Line plots

economics_long%>%
  plot_ly(
    x=~date,
    y=~value,
    color=~variable,
    colors = pal_deloitte,
    type="scatter",
    mode="lines"
  )

Line plots

Bar charts

diamonds %>% 
  count(cut, clarity) %>%
  plot_ly(colors = pal_deloitte)%>%
  add_bars(
    x = ~cut, 
    y = ~n, 
    color = ~clarity
  )

Bar charts

Bar charts

diamonds%>%
  group_by(color,clarity)%>%
  summarise(n=n())%>%
  mutate(
    nn=sum(n),
    prop=n/nn
  )%>%
  plot_ly(x = ~color,colors = pal_deloitte)%>%
  add_bars(
    y = ~prop, 
    color = ~clarity
  ) %>%
  layout(barmode = "stack")

Bar charts

Heatmaps

diamonds%>%
  group_by(cut, clarity) %>%
  summarise(N=n())%>%
  plot_ly() %>%
  add_heatmap( 
    x = ~cut, 
    y = ~clarity, 
    z =~N
  )

Heatmaps

Box plots

diamonds%>%
  plot_ly(colors = pal_deloitte)%>%
  add_boxplot(
    x = ~cut, 
    y = ~price, 
    color = ~clarity
  ) %>%
  layout(boxmode = "group")

Box plots

Histograms

plot_ly(alpha = 0.6,colors = pal_deloitte) %>%
  add_histogram(x = ~rnorm(500)) %>%
  add_histogram(x = ~rnorm(500) + 1) %>%
  layout(barmode = "overlay")

Histograms

Histograms

plot_ly(alpha = 0.6,colors = pal_deloitte) %>%
  add_histogram(x = ~rnorm(500)) %>%
  add_histogram(x = ~rnorm(500) + 1) %>%
  layout(barmode = "stack")

Histograms

Histograms

diamonds%>%
  plot_ly(colors = pal_deloitte2)%>%
  add_histogram2d(x = ~carat, y = ~price)

Histograms

Histograms

diamonds%>%
  plot_ly(colors = pal_deloitte2)%>%
  add_histogram2dcontour(x = ~carat, y = ~price)

Histograms